Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# set size and style of the seaborn plots
sns.set_style("darkgrid")
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.4f" % x)
# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
roc_auc_score,
roc_curve,
precision_recall_curve,
confusion_matrix,
classification_report,
make_scorer,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# mounting the data from Google Drive
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset(s)
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects/Project 6/Train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects/Project 6/Test.csv')
# copying the training data to another variable to avoid any changes to original data
data = df.copy()
# let's create a copy of the testing data
data_test = df_test.copy()
# training dataset
data.head()
data.tail()
# test dataset
data_test.head()
data_test.tail()
# Checking the number of rows and columns in the training data
data.shape
# Checking the number of rows and columns in the test data
data_test.shape
# checking the train dataset
data.duplicated().sum()
# checking the test dataset
data_test.duplicated().sum()
data.dtypes #train dataset
# let's check the count and data types of the columns in the train dataset
data.info()
# let's check the count and data types of the columns in the test dataset
data_test.info()
# let's check for missing values in the train data
data.isnull().sum()
# let's check for missing values in the test data
data_test.isnull().sum()
Data Overview Observations
# let's view the statistical summary of the numerical columns in the train data
data.describe().T
Plotting histograms and boxplots for all the variables
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
for feature in data.columns:
histogram_boxplot(
data, feature, figsize=(12, 7), kde=True, bins=None
)
Summary Observations
# “1” in the target variables is “failure” and “0” represents “No failure”.
data["Target"].value_counts(1)
Observations
# “1” in the target variables is “failure” and “0” represents “No failure”.
data_test["Target"].value_counts()
Observations on Test Data
# lets carry out a correlation heat map to observe target variable
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(30, 30))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="coolwarm"
)
plt.show()
Looking at the Target variable correlation, we can observe positive strong correlation with sensor variables: V15, V7, V16, V21, V28
To prevent data leakage, it is generally recommended to carry out the data split into training and validation sets before performing any imputation or preprocessing steps*
Split should be performed on the Train.csv dataset to create separate training and validation sets. The Test.csv dataset should be kept as a standalone dataset for final evaluation after model development.
# Separating target variable and other variables into X and y
X = data.drop(["Target"], axis=1)
y = data["Target"]
# Splitting data into training and validation set:
# test_size parameter specifies the proportion of data to be allocated for validation (in this case, 20% of the data)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y)
# Checking the number of rows and columns in the X_train data
# Checking the number of rows and columns in the X_val data
print(X_train.shape, X_val.shape)
Observations
# Dividing test data into X_test and y_test
# drop target variable from test data
# store target variable in y_test
# note, we do not needto split the test data as the dataset should be kept as a standalone dataset for final model
X_test = data_test.drop(["Target"], axis = 1)
y_test = data_test["Target"]
# Checking the number of rows and columns in the X_test data
X_test.shape
'''
Rather than using a single SimpleImputer instance with the strategy set to "median",
I will uses a separate SimpleImputer instance for each column with missing values that needs imputation
with the median
In terms of handling missing values in the Train and Test datasets, it is generally recommended to use
separate Imputer
Refer to Appendix Student Notes : Imputer
'''
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
cols_to_impute = ["V1", "V2"]
# fit and transform imputer on train data
X_train[cols_to_impute] = imp_median.fit_transform(X_train[cols_to_impute])
# Transform on validation and test data
X_val[cols_to_impute] = imp_median.transform(X_val[cols_to_impute])
# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_median.transform(X_test[cols_to_impute])
# Validate that no columns have missing values in train or test sets
print('Count of missing values in train set')
print(X_train.isna().sum())
print("-" * 30)
print()
print('Count of missing values in validation set')
print(X_val.isna().sum())
print("-" * 30)
print()
print('Count of missing values in test set')
print(X_test.isna().sum())
Observations
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score, greater_is_better=True)
We are now done with pre-processing and evaluation criterion, so let's start building the model.
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Based on the provided cross-validation and validation performance scores, we can analyze the performance of different models on the training dataset and the validation dataset.
*Based on these results, XGBoost seems to be the most effective model for this task, as it achieves the highest performance on both the cross-validation and validation datasets.*
Student Note : If a model performs significantly better on the training dataset than on the validation dataset, it suggests overfitting
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Observations
# check the count of target variable before oversampling
# count the instances with label '1' and '0'
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique (SMOTE)
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
#Check the count of target variables after oversampling
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
#give the number of instances (rows) and the number of features (columns) in the oversampled training data
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
# represents the target variable after applying oversampling
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Observation
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")))
results2 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset::" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results2.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Based on the provided cross-validation scores and validation performance for both the original data and the oversampled data model, we can make the following observations:
Original Data Model:
Oversampled Data Model:
Conclusion :
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison Oversampled Model")
ax = fig.add_subplot(111)
plt.boxplot(results2)
ax.set_xticklabels(names)
plt.show()
Based on the boxplot comparing the cross-validation scores and validation performance of the oversampled models, we can make the following inferences :
Logistic Regression: The boxplot shows that the cross-validation scores for logistic regression are relatively higher tahn the original model, indicating consistent performance across different folds.
Bagging, Decision Tree, Random Forest : These models exhibit similar boxplots, with relatively narrow IQRs. This indicates lower variability among different folds. The validation performance is slightly lower compared to the cross-validation scores. This suggests that these models might be slightly overfitting the training data.(learns the specific patterns and noise in the training data a little too well, resulting in lower performance on new, unseen data)
XGBoost: The boxplot for XGBoost shows the highest median cross-validation score among all models, indicating consistently high performance across different folds.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Student Note : It's important to note that under-sampling reduces the amount of available training data, which can lead to a loss of information and potentially affect the model's overall performance.
Both over-sampling and under-sampling techniques aim to address class imbalance issue : Over-sampling increases the number of minority class samples, while under-sampling reduces the number of majority class samples.
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")))
results3 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results3.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison with Undersampled Model")
ax = fig.add_subplot(111)
plt.boxplot(results3)
ax.set_xticklabels(names)
plt.show()
Comparing the undersampled model to the original model, we can observe the following:
Cross-validation performance: The undersampled model generally showed higher cross-validation performance across all the evaluated algorithms compared to the original model. This indicates that the undersampled model performed better on the training data when using cross-validation.
Validation performance: The undersampled model also generally showed higher validation performance compared to the original model. This suggests that the undersampled model performs better when evaluated on the validation set, which provides an indication of its generalization ability.
To select the three best-performing models based on the goal of reducing false negatives and maximizing the recall score, we can compare the recall scores obtained from both the cross-validation performance and the validation performance for each model.
Based on the provided recall scores, the three best-performing models and their scores are:
1. XGBoost:
2. Random Forest:
3. Gradient Boosting:
XGBClassifier().get_params()
RandomForestClassifier().get_params()
GradientBoostingClassifier().get_params()
%%time
# defining model - RandomForest Hyperparameter Tuning
model_rf = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid_rf = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(5, 10), # larger values can help prevent overfitting
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], # strike a balance between randomness and stability in feature selection.
"max_samples": np.arange(0.4, 0.7, 0.1),
"class_weight" : ['balanced', 'balanced_subsample'], #options 'balanced' and 'balanced_subsample' are suitable for handling imbalanced datasets
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_rf, param_distributions=param_grid_rf, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
## fit the model on over sampled data training dataset (X_train, y_train)
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters - Original Data
tuned_rf = RandomForestClassifier(
n_estimators= 250, min_samples_leaf= 8, min_impurity_decrease= 0.003, max_samples= 0.6,
max_features= "sqrt", class_weight= "balanced_subsample",random_state=1,
)
tuned_rf.fit(X_train, y_train)
#Checking the performance on the training set - Original dataset
print("Recall on train set")
tuned_rf_perf = model_performance_classification_sklearn(
tuned_rf, X_train, y_train
)
tuned_rf_perf
#Checking the performance on the validation set - Original dataset
print("Recall on validation set")
tuned_rf_val = model_performance_classification_sklearn(tuned_rf, X_val, y_val)
tuned_rf_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf, X_val, y_val)
%%time
# defining model - RandomForest Hyperparameter Tuning
model_rf = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid_rf = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(5, 10), # larger values can help prevent overfitting
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], # strike a balance between randomness and stability in feature selection.
"max_samples": np.arange(0.4, 0.7, 0.1),
"class_weight" : ['balanced', 'balanced_subsample'], #options 'balanced' and 'balanced_subsample' are suitable for handling imbalanced datasets
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_rf, param_distributions=param_grid_rf, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
## fit the model on over sampled data training dataset (X_train_over, y_train_over)
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters - oversampled data
tuned_rf1 = RandomForestClassifier(
n_estimators= 250, min_samples_leaf= 5, min_impurity_decrease= 0.001, max_samples= 0.5,
max_features= "sqrt", class_weight= "balanced_subsample",random_state=1,
)
tuned_rf1.fit(X_train_over, y_train_over)
#Checking the performance on the oversampled training set
print("Recall on train set")
tuned_rf_os_perf = model_performance_classification_sklearn(
tuned_rf1, X_train_over, y_train_over
)
tuned_rf_os_perf
#Checking the performance on the validation set - oversampled dataset
print("Recall on validation set")
tuned_rf_os_val = model_performance_classification_sklearn(tuned_rf1, X_val, y_val)
tuned_rf_os_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf1, X_val, y_val)
%%time
# defining model - RandomForest Hyperparameter Tuning
model_rf = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid_rf = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(5, 10), # larger values can help prevent overfitting
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], # strike a balance between randomness and stability in feature selection.
"max_samples": np.arange(0.4, 0.7, 0.1),
"class_weight" : ['balanced', 'balanced_subsample'], #options 'balanced' and 'balanced_subsample' are suitable for handling imbalanced datasets
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_rf, param_distributions=param_grid_rf, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
## fit the model on over sampled data training dataset (X_train_un, y_train_un)
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters - undersampled data
tuned_rf2 = RandomForestClassifier(
n_estimators= 200, min_samples_leaf= 5, min_impurity_decrease= 0.001, max_samples= 0.5,
max_features= "sqrt", class_weight= "balanced_subsample",random_state=1,
)
tuned_rf2.fit(X_train_un, y_train_un)
#Checking the performance on the undersampled training set
print("Recall on train set")
tuned_rf_us_perf = model_performance_classification_sklearn(
tuned_rf2, X_train_un, y_train_un
)
tuned_rf_us_perf
#Checking the performance on the validation set - undersampled dataset
print("Recall on validation set")
tuned_rf_us_val = model_performance_classification_sklearn(tuned_rf2, X_val, y_val)
tuned_rf_us_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf2, X_val, y_val)
Observations and Inference
Conclusion :
%%time
# defining model - Gradient Boosting Tuning
model_gb = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid_gb={
"n_estimators":[100, 200, 250], #number of ensemble sizes or trees to build.
"learning_rate": [0.01, 0.1, 0.5], #chosen a relatively lower learning rate for better generalization and less overfitting
"subsample":[0.7,0.8,1.0], #fraction of sample % used for training each tree
"max_depth": [3, 5, 7], #Maximum depth of the individual trees
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'], #given the large dataset size 20,000, using a smaller fraction reduce the risk of overfitting and strike a balance between randomness and stability in feature selection
"min_samples_split": [2, 5, 10], #Minimum number of samples required to split a node
"min_samples_leaf": [1, 2, 4] #Minimum number of samples required at each leaf node
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_gb, param_distributions=param_grid_gb, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_gb = GradientBoostingClassifier(
subsample = 1.0, n_estimators = 250, min_samples_split = 10, min_samples_leaf =2,
max_features ='sqrt', max_depth = 5, learning_rate =0.1, random_state =1,
)
tuned_gb.fit(X_train, y_train)
#Checking the performance on the original training set
print("Recall on train set")
tuned_gb_perf = model_performance_classification_sklearn(
tuned_gb, X_train, y_train
)
tuned_gb_perf
#Checking the performance on the original validation set
print("Recall on validation set")
tuned_gb_val = model_performance_classification_sklearn(tuned_gb, X_val, y_val)
tuned_gb_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_gb, X_val, y_val)
%%time
# defining model - Gradient Boosting Tuning
model_gb = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid_gb={
"n_estimators":[100, 200, 250], #number of ensemble sizes or trees to build.
"learning_rate": [0.01, 0.1, 0.5], #chosen a relatively lower learning rate for better generalization and less overfitting
"subsample":[0.7,0.8,1.0], #fraction of sample % used for training each tree
"max_depth": [3, 5, 7], #Maximum depth of the individual trees
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'], #given the large dataset size 20,000, using a smaller fraction reduce the risk of overfitting and strike a balance between randomness and stability in feature selection
"min_samples_split": [2, 5, 10], #Minimum number of samples required to split a node
"min_samples_leaf": [1, 2, 4] #Minimum number of samples required at each leaf node
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_gb, param_distributions=param_grid_gb, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_gb1 = GradientBoostingClassifier(
subsample = 1.0, n_estimators = 250, min_samples_split = 10, min_samples_leaf =2,
max_features ='sqrt', max_depth = 5, learning_rate =0.5, random_state =1,
)
tuned_gb1.fit(X_train_over, y_train_over)
#Checking the performance on the training set - oversampled data
tuned_gb_os_perf = model_performance_classification_sklearn(
tuned_gb1, X_train_over, y_train_over
)
tuned_gb_os_perf
#Checking the performance on the validation set - oversampled data
tuned_gb_os_val = model_performance_classification_sklearn(tuned_gb1, X_val, y_val)
tuned_gb_os_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_gb1, X_val, y_val)
%%time
# defining model - Gradient Boosting Tuning
model_gb = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid_gb={
"n_estimators":[100, 200, 250], #number of ensemble sizes or trees to build.
"learning_rate": [0.01, 0.1, 0.5], #chosen a relatively lower learning rate for better generalization and less overfitting
"subsample":[0.7,0.8,1.0], #fraction of sample % used for training each tree
"max_depth": [3, 5, 7], #Maximum depth of the individual trees
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'], #given the large dataset size 20,000, using a smaller fraction reduce the risk of overfitting and strike a balance between randomness and stability in feature selection
"min_samples_split": [2, 5, 10], #Minimum number of samples required to split a node
"min_samples_leaf": [1, 2, 4] #Minimum number of samples required at each leaf node
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_gb, param_distributions=param_grid_gb, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_gb2 = GradientBoostingClassifier(
subsample = 0.7, n_estimators = 200, min_samples_split = 2, min_samples_leaf =4,
max_features ='sqrt', max_depth = 5, learning_rate =0.5, random_state =1,
)
tuned_gb2.fit(X_train_un, y_train_un)
#Checking the performance on the training set - undersampled data
tuned_gb_us_perf = model_performance_classification_sklearn(
tuned_gb2, X_train_un, y_train_un
)
tuned_gb_us_perf
#Checking the performance on the validation set - undersampled dataset
print("Recall on validation set")
tuned_gb_us_val = model_performance_classification_sklearn(tuned_gb2, X_val, y_val)
tuned_gb_us_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_gb2, X_val, y_val)
Observations
%%time
# defining model - XGBoost Hyperparameter Tuning
model_xgb = XGBClassifier(random_state=1,eval_metric='logloss')
#have used parameter tree_method : gpu_hist after few long processing time taken (>56min)
param_grid_xgb ={
'tree_method' :['gpu_hist'],
'n_estimators':[150,200,250], #range of options to explore different ensemble sizes
'scale_pos_weight':[1,2], #useful for handling class imbalance
'subsample':[0.9,1], #fraction of samples used for training each tree
'learning_rate':[0.01, 0.1, 0.5], #cover a range from low to high learning rates to find optimal bal btwn model complexity & performance
'gamma':[0,3,5], #low and high values - controls the minimum loss reduction required for a split
"colsample_bytree": [0.8, 0.9], #control the fraction of features (columns) to be randomly sampled for each tree
"colsample_bylevel": [0.9, 1] #and each level
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_xgb, param_distributions=param_grid_xgb, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_xgb = XGBClassifier(
tree_method ='gpu_hist',
subsample = 1,
scale_pos_weight = 2,
n_estimators = 250,
learning_rate = 0.1,
gamma = 0,
colsample_bytree = 0.9,
colsample_bylevel = 1,
random_state =1
)
tuned_xgb.fit(X_train, y_train)
#Checking the performance on the original training set
print("Recall on train set")
tuned_xgb_perf = model_performance_classification_sklearn(
tuned_xgb, X_train, y_train
)
tuned_xgb_perf
#Checking the performance on the original validation set
print("Recall on validation set")
tuned_xgb_val = model_performance_classification_sklearn(tuned_xgb, X_val, y_val)
tuned_xgb_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_xgb, X_val, y_val)
%%time
# defining model - XGBoost Hyperparameter Tuning
model_xgb = XGBClassifier(random_state=1,eval_metric='logloss')
param_grid_xgb ={
'tree_method' :['gpu_hist'],
'n_estimators':[150,200,250], #range of options to explore different ensemble sizes
'scale_pos_weight':[1,2], #useful for handling class imbalance
'subsample':[0.9,1], #fraction of samples used for training each tree
'learning_rate':[0.01, 0.1, 0.5], #cover a range from low to high learning rates to find optimal bal btwn model complexity & performance
'gamma':[0,3,5], #low and high values - controls the minimum loss reduction required for a split
"colsample_bytree": [0.8, 0.9], #control the fraction of features (columns) to be randomly sampled for each tree
"colsample_bylevel": [0.9, 1] #and each level
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_xgb, param_distributions=param_grid_xgb, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_xgb1 = XGBClassifier(
tree_method ='gpu_hist',
subsample = 0.9,
scale_pos_weight = 2,
n_estimators = 150,
learning_rate = 0.5,
gamma = 0,
colsample_bytree = 0.8,
colsample_bylevel = 1,
random_state =1
)
tuned_xgb1.fit(X_train_over, y_train_over)
#Checking the performance on the training set - oversampled data
tuned_xgb_os_perf = model_performance_classification_sklearn(tuned_xgb1, X_train_over, y_train_over)
tuned_xgb_os_perf
#Checking the performance on the validation set - oversampled data
tuned_xgb_os_val = model_performance_classification_sklearn(tuned_xgb1, X_val, y_val)
tuned_xgb_os_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_xgb1, X_val, y_val)
%%time
# defining model - XGBoost Hyperparameter Tuning
model_xgb = XGBClassifier(random_state=1,eval_metric='logloss')
param_grid_xgb ={
'tree_method' :['gpu_hist'],
'n_estimators':[150,200,250], #range of options to explore different ensemble sizes
'scale_pos_weight':[1,2], #useful for handling class imbalance
'subsample':[0.9,1], #fraction of samples used for training each tree
'learning_rate':[0.01, 0.1, 0.5], #cover a range from low to high learning rates to find optimal bal btwn model complexity & performance
'gamma':[0,3,5], #low and high values - controls the minimum loss reduction required for a split
"colsample_bytree": [0.8, 0.9], #control the fraction of features (columns) to be randomly sampled for each tree
"colsample_bylevel": [0.9, 1] #and each level
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model_xgb, param_distributions=param_grid_xgb, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Let's build a model with obtained best parameters
tuned_xgb2 = XGBClassifier(
tree_method ='gpu_hist',
subsample = 0.9,
scale_pos_weight = 2,
n_estimators = 200,
learning_rate = 0.1,
gamma = 3,
colsample_bytree = 0.8,
colsample_bylevel = 1,
random_state =1
)
tuned_xgb2.fit(X_train_un, y_train_un)
#Checking the performance on the training set - undersampled data
tuned_xgb_us_perf = model_performance_classification_sklearn(
tuned_xgb2, X_train_un, y_train_un
)
tuned_xgb_us_perf
#Checking the performance on the validation set - undersampled dataset
print("Recall on validation set")
tuned_xgb_us_val = model_performance_classification_sklearn(tuned_xgb2, X_val, y_val)
tuned_xgb_us_val
# creating confusion matrix
confusion_matrix_sklearn(tuned_xgb2, X_val, y_val)
Observations
Overall, the XGBoost model has demonstrated strong performance across all data sets, effectively identifying failures with high recall and maintaining a good balance between true positives and false positives. The model performs particularly well on the original and oversampled data, while its performance is slightly compromised on the undersampled data.
We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best.
# training performance comparison
models_train_comp_df = pd.concat(
[
tuned_rf_perf.T,
tuned_rf_os_perf.T,
tuned_rf_us_perf.T,
tuned_gb_perf.T,
tuned_gb_os_perf.T,
tuned_gb_us_perf.T,
tuned_xgb_perf.T,
tuned_xgb_os_perf.T,
tuned_xgb_us_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Random Forest Tuned with Original Data",
"Random Forest Tuned with Oversampled Data",
"Random Forest Tuned with Undersampled Data",
"Gradient Boosting Tuned with Original Data",
"Gradient Boosting Tuned with Oversampled Data",
"Gradient Boosting Tuned with Undersampled Data",
"XGBoost Tuned with Original Data",
"XGBoost Tuned with Oversampled Data",
"XGBoost Tuned with Undersampled Data",
]
print("Training Performance Comparison:")
models_train_comp_df
# validation performance comparison
models_val_comp_df = pd.concat(
[
tuned_rf_val.T,
tuned_rf_os_val.T,
tuned_rf_us_val.T,
tuned_gb_val.T,
tuned_gb_os_val.T,
tuned_gb_us_val.T,
tuned_xgb_val.T,
tuned_xgb_os_val.T,
tuned_xgb_us_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Random Forest Tuned with Original Data",
"Random Forest Tuned with Oversampled Data",
"Random Forest Tuned with Undersampled Data",
"Gradient Boosting Tuned with Original Data",
"Gradient Boosting Tuned with Oversampled Data",
"Gradient Boosting Tuned with Undersampled Data",
"XGBoost Tuned with Original Data",
"XGBoost Tuned with Oversampled Data",
"XGBoost Tuned with Undersampled Data",
]
print("Validation Performance Comparison:")
models_val_comp_df
Observations
Based on the goal of maximizing the prediction of generator failures and minimizing false negatives (missed failures), the evaluation metric of interest is Recall.
XGBoost (Undersampled Data) achieves the highest Recall value of 0.9009. Therefore, based on the given performance comparison and the goal of maximizing Recall, XGBoost with undersampled data is the recommended final model for predicting generator failures on the test data
# Let's check the performance on test set
print("XGBoost Tuned with Undersampled Test Data Performance:")
xgb_grid_test = model_performance_classification_sklearn(tuned_xgb2, X_test, y_test)
xgb_grid_test
confusion_matrix_sklearn(tuned_xgb2, X_test, y_test)
Observations
From the performance of the final XGBoost model on the test set, we can make the following inferences :
Accuracy: The final XGBoost model achieves an accuracy of 0.9206 on the test set. This indicates that the model correctly predicts approximately 92.06% of the instances in the test set.
Recall: The recall of the final XGBoost model is 0.8830, indicating that it correctly identifies approximately 88.30% of the actual generator failures in the test set. This is an improvement compared to the recall values of the other models.
Precision: The precision of the final XGBoost model is 0.4062, which means that out of all the instances predicted as failures by the model, only approximately 40.62% of them are actually true positives. This is a lower precision value compared to some of the other models, indicating a higher number of false positives.
F1 Score: The F1 score of the final XGBoost model is 0.5564, which is a harmonic mean of precision and recall. It provides a balanced measure of the model's overall performance. The F1 score takes into account both the precision and recall, and a higher value indicates a better trade-off between these two metrics.
Conclusion : The undersampled tuned XGBoost model has generalized well on the testing data.
feature_names = X_train.columns
importances = tuned_xgb2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
#Building a pipeline model with the best model
model_pipeline = Pipeline(
steps=[
(
"XGBoost Tuned with Undersampled Data",
XGBClassifier(
tree_method ='gpu_hist',
subsample = 0.9,
scale_pos_weight = 2,
n_estimators = 200,
learning_rate = 0.1,
gamma = 3,
colsample_bytree = 0.8,
colsample_bylevel = 1,
random_state =1
),
),
]
)
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]
# Since we already have a separate test set, we don't need to divide data into train and test
X_test1 = df_test.drop(columns="Target") #drop target variable from test data
y_test1 = df_test["Target"] #store target variable in y_test1
# We can't undersample data without doing missing value treatment,
# so let's first treat the missing values in the train set
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)
# let's treat the missing values in the test set
#X2 = imputer.fit_transform(X_test1)
# code for undersampling on the data
# Using RandomUnderSampler technique
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# fit the pipeline model obtained from above step
model_pipeline.fit(X_train_un, y_train_un)
#Checking the performance of the pipeline on test set
model_pipeline_test = model_performance_classification_sklearn(model_pipeline, X_test1,y_test1)
model_pipeline_test
Observations
In terms of handling missing values in the Train and Test datasets, it is generally recommended to use separate Imputer. Reasons are :
If the missing values in the Train and Test datasets are in different columns, we can modify the code using imp_median = SimpleImputer(strategy='median') and identify the columns with missing values 'in each dataset' as below
train_cols_with_missing = X_train.columns[X_train.isnull().any()].tolist()